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Abstract 

Learning a distinct representation for each 
sense of an ambiguous word could lead 
to more powerful and fine-grained mod¬ 
els of vector-space representations. Yet 
while ‘multi-sense’ methods have been 
proposed and tested on artificial word- 
similarity tasks, we don’t know if they im¬ 
prove real natural language understanding 
tasks. In this paper we introduce a multi¬ 
sense embedding model based on Chinese 
Restaurant Processes that achieves state of 
the art performance on matching human 
word similarity judgments, and propose 
a pipelined architecture for incorporating 
multi-sense embeddings into language un¬ 
derstanding. 

We then test the performance of our model 
on pait-of-speech tagging, named entity 
recognition, sentiment analysis, semantic 
relation identification and semantic relat¬ 
edness, controlling for embedding dimen¬ 
sionality. We find that multi-sense embed¬ 
dings do improve performance on some 
tasks (part-of-speech tagging, semantic re¬ 
lation identification, semantic relatedness) 
but not on others (named entity recogni¬ 
tion, various forms of sentiment analysis). 
We discuss how these differences may be 
caused by the different role of word sense 
information in each of the tasks. The re¬ 
sults highlight the importance of testing 
embedding models in real applications. 


single embedding (e.g., 3)). Thus the embedding 
for homonymous words like bank (with senses in¬ 
cluding ‘sloping land’ and ‘financial institution’) 
is forced to represent some uneasy central ten¬ 
dency between the various meanings. More fine¬ 
grained embeddings that represent more natural 
regions in semantic space could thus improve lan¬ 
guage understanding. 

Early research pointed out that embeddings 
could model aspects of word sense (Kintsch, 
2001) and recent research has proposed a number 
of models that represent each word type by dif¬ 
ferent senses, each sense associated with a sense- 
specific embedding (Kintsch, 2001; Reisinger and 
Mooney, 2010; Neelakantan et ah, 2014; Huang et 
ah, 2012; Chen et ah, 2014; Pina and Johansson, 
2014; Wu and Giles, 2015; Liu et ah, 2015). Such 
sense-specific embeddings have shown improved 
performance on simple artificial tasks like match¬ 
ing human word similarity judgments— WS353 
(Rubenstein and Goodenough, 1965) or MC30 
(Huang et ah, 2012). 

Incorporating multisense word embeddings into 
general NLP tasks requires a pipelined architec¬ 
ture that addresses three major steps: 

1. Sense-specific representation learning: 

learn word sense specific embeddings from a 
large corpus, either unsupervised or aided by 
external resources like WordNet. 

2. Sense induction: given a text unit (a phrase, 
sentence, document, etc.), infer word senses 
for its tokens and associate them with corre¬ 
sponding sense-specific embeddings. 


1 Introduction 

Enriching vector models of word meaning so they 
can represent multiple word senses per word type 
seems to offer the potential to improve many lan¬ 
guage understanding tasks. Most traditional em¬ 
bedding models associate each word type with a 


3. Representation acquisition for phrases or 
sentences: learn representations for text 
units given sense-specific embeddings and 
pass them to machine leai'ning classifiers. 

Most existing work on multi-sense embeddings 
emphasizes the first step by learning sense spe- 


cific embeddings, but does not explore the next 
two steps. These are important steps, however, 
since it isn’t clear how existing multi-sense em¬ 
beddings can be incorporated into and benefit real- 
world NLU tasks. 

We propose a pipelined architecture to address 
all three steps and apply it to a variety of NLP 
tasks: pait-of-speech tagging, named entity recog¬ 
nition, sentiment analysis, semantic relation iden¬ 
tification and semantic relatedness. We find: 

• Multi-sense embeddings give improved per¬ 
formance in some tasks (e.g., semantic sim¬ 
ilarity for words and sentences, seman¬ 
tic relation identification part-of-speech tag¬ 
ging), but not others (e.g., sentiment analysis, 
named entity extraction). In our analysis we 
offer some suggested explanations for these 
differences. 

• Some of the improvements for multi-sense 
embeddings are no longer visible when us¬ 
ing more sophisticated neural models like 
LSTMs which have more flexibility in fil¬ 
tering away the informational chaff from the 
wheat. 

• It is important to carefully compare against 
embeddings of the same dimensionality. 

• When doing so, the most straightforward way 
to yield better performance on these tasks is 
just to increase embedding dimensionality. 

After describing related work, we introduce the 
new unsupervised sense-learning model in section 
3, give our sense-induction algorithm in section 4, 
and then in following sections evaluate its perfor¬ 
mance for word similarity, and then various NLP 
tasks. 

2 Related Work 

Neural embedding learning frameworks represent 
each token with a dense vector representation, 
optimized through predicting neighboring words 
or decomposing co-occurrence matrices (3; Col- 
lobert and Weston, 2008; Mnih and Hinton, 2007; 
Mikolov et al., 2013; Mikolov et al., 2010; Pen¬ 
nington et al., 2014). Standard neural models rep¬ 
resent each word with a single unique vector rep¬ 
resentation. 

Recent work has begun to augment the neu¬ 
ral paradigm to address the multi-sense problem 


by associating each word with a series of sense 
specific embeddings. The central idea is to aug¬ 
ment standard embedding learning models like 
skip-grams by disambiguating word senses based 
on local co-occurrence— e.g., the fruit “apple” 
tends to co-occur with the words “cider, tree, pear” 
while the homophonous IT company co-occurs 
with words like “iphone”, “Google” or “ipod”. 

For example Reisinger and Mooney (2010) and 
Huang et al. (2012) propose ways to develop mul¬ 
tiple embeddings per word type by pre-clustering 
the contexts of each token to create a fixed num¬ 
ber of senses for each word, and then relabel¬ 
ing each word token with the clustered sense be¬ 
fore learning embeddings. Neelakantan et al. 
(2014) extend these models by relaxing the as¬ 
sumption that each word must have a fixed num¬ 
ber of senses and using a non-parametric model 
setting a threshold to decide when a new sense 
cluster should be split off; Liu et al. (2015) 
learns sense/topic specific embeddings by com¬ 
bining neural frameworks wifh LDA topic mod¬ 
els. Wu and Giles (2015) disambiguate sense em¬ 
beddings from Wikipedia by first clustering wiki 
documents. Chen et al. (2014) turn to external re¬ 
sources and used a predefined inventory of senses, 
building a distinct representation for every sense 
defined by the Wordnet dictionary. Other rele¬ 
vant work includes Qiu et al. (2014) who main¬ 
tains separate representations for different part-of- 
speech tags of the same word. 

Recent work is mostly evaluated on the rela¬ 
tively artificial fask of mafching human word sim- 
ilarify Judgmenfs. 

3 Learning Sense-Specific Embeddings 

We propose fo build on this previous literature, 
most specifically Huang et al. (2012) and Nee¬ 
lakantan et al. (2014), to develop an algorithm 
for leai'ning multiple embeddings for each word 
type, each embedding corresponding to a distinct 
induced word sense. Such an algorithm should 
have the property that a word should be associated 
with a new sense vector just when evidence in the 
context (e.g., neighboring words, document-level 
co-occurrence statistics) suggests that it is suffi¬ 
ciently different from its early senses. Such a line 
of thinking naturally points to Chinese Restau¬ 
rant Processes (GRP) (Blei et al., 2004; Teh et 
al., 2006) which have been applied in the related 
field of word sense induction. In the analogy of 


CRP, the current word could either sit at one of 
the existing tables (belonging to one of the exist¬ 
ing senses) or choose a new table (a new sense). 
The decision is made by measuring semantic re¬ 
latedness (based on local context information and 
global document information) and the number of 
customers already sitting at that table (the popu¬ 
larity of word senses). We propose such a model 
and show that it improves over the state of the art 
on a standard word similarity task. 

3.1 Chinese Restaurant Processes 

We offer a brief overview of Chinese Restaurant 
Processes in this section; readers interested in 
more details can consult the original papers (Blei 
et ah, 2004; Teh et ah, 2006; Pitman, 1995). 
CRP can be viewed as a practical interpretation 
of Dirichlet Processes (Ferguson, 1973) for non- 
parametric clustering. In the analogy, each data 
point is compared to a customer in a restaurant. 
The restaurant has a series of tables t, each of 
which serves a dish dt- This dish can be viewed as 
the index of a cluster or a topic. The next customer 
w to enter would either choose an existing table, 
shaiing the dish (cluster) already served or choos¬ 
ing a new cluster based on the following probabil¬ 
ity distribution: 


As in the standard vector-space model, each to¬ 
ken w is associated with a K dimensional global 
embedding Sw Additionally, it is associated with 
a set of senses = {zl^, z'^,} where 
I Zn; I denotes the number of senses discovered for 
word w. Each sense 2 : is associated with a distinct 
sense-specific embedding e^. When we encounter 
a new token w in the text, at the first stage, we 
maximize the probability of seeing the current to¬ 
ken given its context as in standard language mod¬ 
els using the global vector e ^,: 

P(Cui|Cneigh) — Cneigh) (2) 

F() can take different forms in different learn¬ 
ing paradigms, e.g., F = 
for skip-gram or F = pie^,, g{e-w)) for SENNA 
(Collobert and Weston, 2008) and CBOW, where 
g{eneigh) denotes a function that projects the con¬ 
catenation of neighboring vectors to a vector with 
the same dimension as for SENNA and the 
bag-or-word averaging for CBOW (Mikolov et ah, 
2013). 

Unlike traditional one-word-one-vector frame¬ 
works, Cneigh includes sense information in addi¬ 
tion to the global vectors for neighbors. Cneigh can 
therefore be written as^. 


= t) (X 


NtP{w\dt) if t already exists 
'rP{w\dnew) if t is new 


(1) 

where Nt denotes the number of customers al¬ 
ready sitting at table t and P{w\dt) denotes the 
probability of assigning the current data point to 
cluster dt- 7 is the hyper parameter controlling the 
preference for sitting at a new table. 

CRPs exhibit a useful “rich get richer” prop¬ 
erty because they take into account the popular¬ 
ity of different word senses. They are also more 
flexible than a simple threshold strategy for set¬ 
ting up new clusters, due to the robustness intro¬ 
duced by adopting the relative ratio of P{w\dt) 
and P{w\dnew)- 


3.2 Incorporating CRP into Distributed 
Language Models 

We describe how we incorporate CRP into a stan¬ 
dard distributed language model'. 

*We omit details about training standard distributed mod¬ 
els; see Collobert and Weston (2008) and Mikolov et al. 
(2013). 


Cneigh — {Cn— fcj j • ; Cn—1 j Cn-|-1) Cn—fc} 

(3) 

Next we would use CRP to decide which sense 
the current occurrence corresponds to, or construct 
a new sense if it is a new meaning that we have not 
encountered before. Based on CRP, the probabil¬ 
ity that assigns the current occurrence to each of 
the discovered senses or a new sense is given by: 


Pr{zw 


' N^P{el\context) 


z) oc < 


if z already exists 


jP{w\znew) if z is new 


(4) 


where denotes the number of times already 
assigned to sense 2 for token w. P(e^ [context) 
denotes the probability that current occurrence be¬ 
longing to (or generated by) sense z. 

The algorithm for parameter update for the one 
token predicting procedure is illustrated in Figure 

^For models that predict succeeding words, sense labels 
for preceding words have already been decided. For models 
that predict words using both left and right contexts, the la¬ 
bels for right-context words have not been decided yet. In 
such cases we just use its global word vector to fill up the 
position. 





01: Input : Token sequence {tUnj ^^neigh}- 

02: Update parameters involved in Equ (3)(4) 

based on current word prediction. 

03: Sample sense label z from CRP. 

04: If a new sense label z is sampled: 

05: - add z to 

06: - = argmaxp(ru„|zm) 

07: else: update parameters involved based on 
sampled sense label z. 


Figure 1: Incorporating CRP into Neural Lan¬ 
guage Models. 

1: Line 2 shows parameter updating through pre¬ 
dicting the occurrence of current token. Lines 4-6 
illustrate the situation when a new word sense is 
detected, in which case we would add the newly 
detected sense z into . The vector representa¬ 
tion for the newly detected sense would be ob¬ 
tained by maximizing the function p(e^|context). 

As we can see, the model performs word-sense 
clustering and embedding learning jointly, each 
one affecting the other. The prediction of the 
global vector of the current token (line2) is based 
on both the global and sense-specific embeddings 
of its neighbors, as will be updated through pre¬ 
dicting the current token. Similarly, once the sense 
label is decided (lineV), the model will adjust the 
embeddings for neighboring words, both global 
word vectors and sense-specific vectors. 

4 Obtaining Word Representations for 
NLU tasks 

Nexf we describe how we decide sense labels for 
tokens in confexf. The scenario is freafed as a in¬ 
ference procedure for sense labels where all global 
word embeddings and sense-specific embeddings 
are kepf fixed. 

Given a documenf or a senfence, we have an 
objecfive function wifh respecf fo sense labels 
by mulfiplying Eq.2 over each confaining foken. 
Compufing fhe global opfimum sense labeling— 
in which every word gels an oplimal sense label— 
requires searching over fhe space of all senses for 
all words, which can be expensive. We fherefore 
chose Iwo simplified heuristic approaches: 

• Greedy Search: Assign each foken fhe lo¬ 
cally opfimum sense label and represenl fhe 
currenf foken wifh fhe embedding associafed 


Model 

Dataset 

SCWS Correlation 

SkipGram 

I.IB (wiki) 

64.6 

SG+Greedy 

I.IB (wiki) 

66.4 

SG+Expect 

I.IB (wiki) 

67.0 

SkipGram 

120B 

66.4 

SG+Greedy 

120B 

69.1 

SG+Expect 

120B 

69.7 


Table 1: Performances for differenf sef of mulli- 
sense embeddings (300d) evaluated on SCWS by 
measuring fhe Spearman correlafion befween each 
model’s similarify and fhe human judgments. 

with that sense. 

• Expectation: Compute the probability of 
each possible sense for the current word, and 
represent the word with the expectation vec¬ 
tor: 

ew = ^ context) • 

5 Word Similarity Evaluation 

We evaluate our embeddings by comparing with 
other multi-sense embeddings on the standard ar¬ 
tificial task for matching human word similarity 
judgments. 

Early work used similarity datasets like WS353 
(Finkelstein et ah, 2001) or RG (Rubenstein and 
Goodenough, 1965), whose context-free nature 
makes them a poor evaluation. We therefore adopt 
Stanford’s Contextual Word Similarities (SCWS) 
(Huang et ah, 2012), in which human judgments 
are associated with pairs of words in context. Thus 
for example “bank” in the context of “river bank” 
would have low relatedness with “deficit” in the 
context “financial deficif”. 

We frained our models on fhe fwo dafasefs: 
Wikipedia dafasef which is comprised of 1.1 bil¬ 
lion tokens and a large dafasef by combining 
Wikipedia, Gigaword and Common crawl datasef, 
which is comprised of 120 billion tokens. We ifer- 
afe over fhe dafasef for 3 times, wifh window size 
11. We nexf use fhe Greedy or Expecfafion sfrafe- 
gies fo obfain word vecfors for fokens given fheir 
confexf. These vecfors are fhen used as inpuf fo gef 
fhe value of cosine similarify befween fwo words. 

Performances are reported in Table 1. Con¬ 
sistent with earlier work (e.g.., Neelakantan 
et al. (2014)), we find fhaf multi-sense em¬ 
beddings resulf in better performance in fhe 
confexf-dependenf SCWS task (SG-i-Greedy and 










SG+Expect are better than SG). As expected, 
performance is not as high when global level 
information is ignored when choosing word 
senses (SG+Greedy) as when it is included 
(SG+Expect), as neighboring words don’t provide 
sufficient information for word sense disambigua¬ 
tion. SG+Expect yields +2.4 performance boost 
than one-word-one-vector strategy on 1.1 billion 
Wikipedia dataset^ and +3.2 on the common crawl 
datast. 

Visualization Table 2 shows examples of se¬ 
mantically related words given the local context. 
Word embeddings for tokens are obtained by using 
the inferred sense labels from the Greedy model 
and are then used to search for nearest neighbors in 
the vector space based on cosine similarity. Eike 
earlier models (e.g., Neelakantan et al. (2014))., 
the model can disambiguate different word senses 
(in examples like bank, rock and apple) based on 
their local context; although of course the model 
is also capable of dealing with polysemy—senses 
that are less distinct. 

6 Experiments on NLP Tasks 

Having shown that multi-sense embeddings im¬ 
prove word similarity tasks, we turn to ask 
whether they improve real-world NEU tasks: POS 
tagging, NER tagging, sentiment analysis at the 
phrase and sentence level, semantic relationship 
identification and sentence-level semantic related¬ 
ness. Eor each task, we experimented on the fol¬ 
lowing sets of embeddings, which are trained us¬ 
ing the word2vec package on the same corpus: 

• Standard one-word-one-vector embeddings 
from skip-gram (50d). 

• Sense disambiguated embeddings from Sec¬ 
tion 3 and 4 using Greedy Search and Expec¬ 
tation (50d) 

• The concatenation of global word embed¬ 
dings and sense-specific embeddings (lOOd). 

• Sfandard one-word-one-vecfor skip-gram 
embeddings wifh dimensionalify doubled 
(lOOd) (lOOd is fhe correcf corresponding 
baseline since fhe concafenafion above 
doubles fhe dimensionalify of word vectors) 

^(Neelakantan et al., 2014) reported a result of 69.3 on 
sews dataset trained from Wikipedia corpus, outperform¬ 
ing the proposed model described in this paper trained on the 
similar-size corpus in spite of the fact that different Wikipedia 
dumps and preprocessing techniques are adopted. 


• Embeddings wifh very high dimensionality 
(300d). 

As far as possible we fry to perform an apple- 
fo-apple comparison on fhese fasks, and our goal 
is an analyfic one—to invesfigafe how well se¬ 
mantic information can be encoded in mulfi-sense 
embeddings and how fhey can improve NEU 
performances—rafher fhan an affempf fo create 
slale-of-lhe-arl resulfs. Thus for example, in lag¬ 
ging fasks (e.g., NER, POS), we follow fhe proto¬ 
cols in (Colloberf ef al., 2011) using fhe concafe- 
nafion of neighboring embeddings as inpuf fea- 
fures rafher fhan Irealing embeddings as auxiliary 
fealures which are fed info a CRE model along 
wifh ofher manually developed fealures as in Pen¬ 
nington ef al. (2014). Or for experimenfs on senfi- 
menl and ofher fasks where sentence level embed¬ 
dings are required we only employ standard recur¬ 
rent or recursive models for sentence embedding 
rather than models with sophisticated state-of-the- 
art methods (e.g., Tai et al. (2015; Irsoy and Cardie 
(2014)). 

Significance testing for comparing models is 
done via fhe boofslrap tesf (Efron and Tibshirani, 
1994). Unless olherwise noted, significanl lesling 
is performed on one-word-one-vecfor embedding 
(50d) versus mulfi-sense embedding using Expec- 
falion inference (50d) and one-veclor embedding 
(lOOd) versus Expeclalion (lOOd). 

6.1 The Tasks 

Named Entity Recognition We use the 

CoNEE-2003 English benchmark for training, 
and test on the CoNEE-2003 test data. We follow 
the protocols in Colloberf et al. (2011), using 
the concatenation of neighboring embeddings as 
input to a multi-layer neural model. We employ 
a five-layer neural architecture, comprised of 
an input layer, three convolutional layers with 
rectifier linear activation function and a softmax 
output layer. Training is done by gradient descent 
with minibatches where each sentence is treated 
as one batch. Eearning rate, window size, number 
of hidden units of hidden layers, E2 regulariza¬ 
tions and number of iterations are tuned on the 
development set. 

Part-of-Speech Tagging We use Sections 0-18 
of the Wall Street Journal (WSJ) data for train¬ 
ing, sections 19-21 for validation and sections 
22-24 for testing. Similar to NER, we trained 5- 
layer neural models which take the concatenation 



Context 

Nearest Neighbors 

Apple is a kind of fruit. 

pear, cherry, mango, juice, peach, plum, fruit, cider, apples, tomato, orange, bean, pie 

Apple releases its new ipads. 

microsoft, intel, dell, ipad, macintosh, ipod, iphone, google, computer, imac, hardware 

He borrowed the money from banks. 

banking, credit, investment, finance, citibank, currency, assets, loads, imf, hsbc 

along the shores of lakes, 
banks of rivers 

land, coast, river, waters, stream, inland, area, coasts, shoreline, shores, peninsula 

Basalt is the commonest volcanic rock. 

boulder, stone, rocks, sand, mud, limestone, volcanic, sedimentary, pelt, lava, basalt 

Rock is the music of teenage rebellion. 

band, pop, bands, song, rap, album, jazz, blues, singer, hip-pop, songs, guitar, musician 


Table 2: Nearest neighbors of words given context. The embeddings from context words are first in- 
feiTed with the Greedy strategy; nearest neighbors are computed by cosine similarity between word 
embeddings. Similar phenomena have been observed in earlier work (Neelakantan et al., 2014) 


Standard (50) 
0.852 

Greedy (50) 
0.852 (+0) 

Expectation) 50) 
0.854 (+0.02) 

Standard (100) 
0.867 

Global+G (100) 
0.866 (-0.01) 

Global+E (100) 
0.871 (+0.04) 

Standard (300) 

0.882 


Table 3: Accuracy for Different Models on 
Name Entity Recognition. Global+E stands 
for Global+Expectation inference and Global+G 
stands for Global+Greedy inference. p-value 
0.223 for Standard(50) verse Expectation (50) and 
0.310 for Standard(lOO) verse Expectation (100). 


of neighboring embeddings as inputs. We adopt a 
similar training and parameter tuning strategy as 
for POS tagging. 


Standard (50) 
0.925 

Greedy (50) 
0.934 (+0.09) 

Expectation (50) 
0.938 (+0.13) 

Standard (100) 
0.940 

Global+G (100) 
0.946 (+0.06) 

Global+E (100) 
0.952 (+0.12) 

Standard (300) 

0.954 


Table 4: Accuracy for Different Models on Part of 
Speech Tagging. P-value 0.033 for 50d and 0.031 
for lOOd. 


Sentence-level Sentiment Classification (Pang) 

The sentiment dataset of Pang et al. (2002) con¬ 
sists of movie reviews with a sentiment label for 
each sentence. We divide the original dataset 
into training(8101)/dev(500)/testing(2000). Word 
embeddings are initialized using the aforemen¬ 
tioned types of embeddings and kept fixed in the 
learning procedure. Sentence level embeddings 
are achieved by using standard sequence recur¬ 
rent neural models (Pearlmutter, 1989) (for de¬ 
tails, please refer to Appendix section). The ob¬ 
tained embedding is then fed into a sigmoid clas¬ 
sifier. Convolutional matrices at the word level are 
randomized from [-0.1, 0.1] and learned from se¬ 


quence models. Eor training, we adopt AdaGrad 
with mini-batch. Parameters (i.e., L2 penalty, 
learning rate and mini batch size) are tuned on 
the development set. Due to space limitations, we 
omit details of recuiTcnt models and training. 


Standard (50) 
0.750 

Greedy (50) 
0.752(+0.02) 

Expectation (50) 
0.750(+0.00) 

Standard (100) 
0.768 

Global+G (100) 
0.765(-0.03) 

Global+E (100) 
0.763(-0.05) 

Standard (300) 

0.774 


Table 5: Accuracy for Different Models on Sen¬ 
timent Analysis (Pang et al.’s dataset). P-value 
0.442 for 50d and 0.375 for lOOd. 


Sentiment Analysis-Stanford Treebank The 

Stanford Sentiment Treebank (Socher et al., 2013) 
contains gold-standard labels for each constituent 
in the parse tree (phrase level), thus allowing us to 
investigate a sentiment task at a finer granularity 
than the dataset in Pang et al. (2002) where 
labels are only found at the top of each sentence. 
The sentences in the treebank were split into a 
training(8544)/development( 1101 )/testing(2210) 
dataset. 

Eollowing Socher et al. (2013) we obtained em¬ 
beddings for tree nodes by using a recursive neu¬ 
ral network model, where the embedding for par¬ 
ent node is obtained in a bottom-up fashion based 
on its children. The embeddings for each parse 
tree constituent are output to a softmax layer; see 
Socher et al. (2013). 

We focus on the standard version of recursive 
neural models. Again we fixed word embeddings 
to each of the different embedding settings de¬ 
scribed above^. Similarly, we adopted AdaGrad 
with mini-batch. Parameters (i.e., L2 penalty, 

'^Note that this is different from the settings used in 
(Socher et al., 2013) where word vectors were treated as pa¬ 
rameters to optimize. 



























learning rate and mini batch size) are tuned on 
the development set. The number of iterations is 
treated as a variable to tune and parameters are 
harvested based on the best performance on the 
development set. 


Standard (50) 
0.818 

Greedy (50) 
0.815 (-0.03) 

Expectation (50) 
0.820 (+0.02) 

Standard (100) 
0.838 

Global-rG (100) 
0.840 (+0.02) 

Global+E (100) 
0.838 (+0.00) 

Standard (300) 

0.854 


Table 6: Accuracy for Different Models on Sen¬ 
timent Analysis (binary classification on Stanford 
Sentiment Treebank.). P-value 0.250 for 50d and 
0.401 for lOOd. 

Semantic Relationship Classification 

SemEval-2010 Task 8 (Hendrickx et al., 2009) 
is to find semanfic relationships befween pairs of 
nominals, e.g., in “My [aparfmentJei has a preffy 
large [kifchen]e 2 ” classifying fhe relafion befween 
[aparfmenf] and [kifchen] as component-whole. 
The dafasef confains 9 ordered relafionships, so 
fhe task is formalized as a 19-class classifica¬ 
tion problem, with directed relations treated as 
separate labels; see Hendrickx et al. (2009) for 
details. 

We follow the recursive implementations de¬ 
fined in Socher ef al. (2012). The path in the parse 
tree between the two nominals is retrieved, and the 
embedding is calculated based on recursive mod¬ 
els and fed to a softmax classifier. For pure com¬ 
parison purpose, we only use embeddings as fea- 
fures and do nof explore ofher combination of ar- 
lificial fealures. We adopf fhe same fraining sfraf- 
egy as for fhe senfimenf fask (e.g., Adagrad, mini- 
bafches, efc). 


Standard (50) 
0.748 

Greedy (50) 
0.760 (+0.12) 

Expectation (50) 
0.762 (+0.14) 

Standard(lOO) 

0.770 

Global+G (100) 
0.782 (+0.12) 

Global+E (100) 
0.778 (+0.18) 

Standard(300) 

0.798 


Table 7: Accuracy for Differenl Models on Se¬ 
mantic Relafionship Idenfificafion. P-value 0.017 
for 50d and 0.020 for lOOd. 


Sentence Semantic Relatedness We use the 

Sentences Involving Compositional Knowledge 
(SICK) dataset (Marelli et at., 2014) consist¬ 
ing of 9927 sentence pairs, split into train- 
ing(4500)/development(500)/Testing(4927). Each 


sentence pair is associated with a gold-standard la¬ 
bel ranging from 1 to 5, indicating how semanti¬ 
cally related are the two sentences, from 1 (the two 
sentences are unrelated) to 5 (the two are very re¬ 
lated). 

In our setting, the similarity between two sen¬ 
tences is measured based on sentence-level em¬ 
beddings. Eet Si and S 2 denote two sentences 
and and denote corresponding embeddings. 

and are achieved through recurrent or re¬ 
cursive models (as illustrated in Appendix sec¬ 
tion). Again, word embeddings are obtained by 
simple table look up in one-word-one-vector set¬ 
tings and inferred using the Greedy or Expecta¬ 
tion strategy in multi-sense settings. We adopt two 
different recurrent models for acquiring sentence- 
level embeddings, a standard recurrent model and 
an ESTM model (Hochreiter and Schmidhuber, 
1997). 

The similarity score is predicted using a regres¬ 
sion model built on the structure of a three layer 
convolutional model, with concatenation of e^i 
and es 2 as input, and a regression score from 1- 
5 as output. We adopted the same training strat¬ 
egy as described earlier. The trained model is then 
used to predict the relatedness score between two 
new sentences. Performance is measured using 
Pearson’s r between the predicted score and gold- 
standard labels. 


Standard! 50) 

Greedy (50) 

Expectation (50) 

0.824 

0.838(+0.14) 

0.836(+0.12) 

Standard (100) 

Global+G (100) 

Global+E (100) 

0.835 

0.840 (+0.05) 

0.845 (+0.10) 

Standard(300) 



0.850 




Table 8: Pearson’s r for Different Models on Se¬ 
mantic Relatedness for Standard Models. P-value 
0.028 for 50d and 0.042 for lOOd. 


Standard(50) 

Greedy(50) 

Expectation(50) 

0.843 

0.848 (+0.05) 

0.846 (+0.03) 

Standard) 100) 

Global+G (100) 

Global+E (100) 

0.850 

0.853 (+0.03) 

0.854 (+0.04) 

Standard(300) 



0.850 




Table 9: Pearson’s r for Different Models on Se¬ 
mantic Relatedness for ESTM Models. P-value 
0.145 for 50d and 0.170 for lOOd. 


6.2 Discussions 

Results for different tasks are represented in Ta¬ 
bles 3-9. 


















At first glance it seems that multi-sense em¬ 
beddings do indeed offer superior performance, 
since combining global vectors with sense-specific 
vectors introduces a consistent performance boost 
for every task, when compared with the standard 
(50d) setting. But of course this is an unfair 
comparison; combining global vector with sense- 
specific vector doubles the dimensionality of vec¬ 
tor to 100, making comparison with standard di¬ 
mensionality (50d) unfair. When comparing with 
standard (100), the conclusions become more nu- 
anced. 

For every task, the -i-Expectation method has 
performances that often seem to be higher than the 
simple baseline (both for the 50d case or the lOOd 
case). However, only some of these differences are 
significant. 

(1) Using multi-sense embeddings is signifi¬ 
cantly helpful for tasks like semantic relatedness 
(Tables 7-8). This is sensible since sentence mean¬ 
ing here is sensitive to the semantics of one partic¬ 
ular word, which could vary with word sense and 
which would directly be reflected on the related¬ 
ness score. 

(2) By contrast, for sentiment analysis (Tables 
5-6), much of the task depends on correctly identi¬ 
fying a few sentiment words like “good” or “bad”, 
whose senses tend to have similar sentiment val¬ 
ues, and hence for which multi-sense embeddings 
offer little help. Multi-sense embeddings might 
promise to help sentiment analysis for some cases, 
like disambiguating the word “sound” in “safe and 
sound” versus “movie sound”. But we suspect that 
such cases are not common, explaining the non¬ 
significance of the improvement. Furthermore, the 
advantages of neural models in sentiment analysis 
tasks presumably lie in their capability to capture 
local composition like negation, and it’s not clear 
how helpful multi-sense embeddings are for that 
aspect. 

(3) Similarly, multi-sense embeddings help for 
POS tagging, but not for NER tagging (Table 3-4). 
Word senses have long been known to be related 
to POS tags. But the largest proportion of NER 
tags consists of the negative not-a-NER (“O”) tag, 
each of which is likely correctly labelable regard¬ 
less of whether senses are disambiguated or not 
(since presumably if a word is not a named entity, 
most of its senses are not named entities either). 

(4) As we apply more sophisticated models like 
ESTM to semantic relatedness tasks (in Table 9), 


the advantages caused by multi-sense embeddings 
disappears. 

(5) Doubling the number of dimensions is suf¬ 
ficient to increase performance as much as using 
the complex multi-sense algorithm. (Of course in¬ 
creasing vector dimensionality (to 300) boosts per¬ 
formance even more, although at the significant 
cost of exponentially increasing time complexity.) 
We do larger one-word-one-vector embeddings do 
so well? We suggest some hypotheses: 

• though information about distinct senses is 
encoded in one-word-one-vector embeddings 
in a mixed and less structured way, we sus¬ 
pect that the compositional nature of neural 
models is able to separate the informational 
chaff from the wheat and choose what infor¬ 
mation to take up, bridging the gap between 
single vector and multi-sense paradigms. Eor 
models like ESTMs which are better at do¬ 
ing such a job by using gates to control in¬ 
formation flow, the difference between two 
paradigms should thus be further narrowed, 
as indeed we found. 

• The pipeline model proposed in the work re¬ 
quires sense-label inference (i.e., step 2). We 
proposed two strategies: GREEDY and EX¬ 
PECTATION, and found that GREEDY mod¬ 
els perform worse than EXPECTATION, as 
we might expect^. But even EXPECTATION 
can be viewed as another form of one-word- 
one-vector models, just one where different 
senses are entangled but weighted to empha¬ 
size the important ones. Again, this suggests 
another cause for the strong relative perfor¬ 
mance of larger-dimensioned one-word-one- 
vector models. 

7 Conclusion 

In this paper, we expand ongoing research into 
multi-sense embeddings by first proposing a new 
version based on Chinese restaurant processes that 
achieves state of the art performance on simple 
word similarity matching tasks. We then intro¬ 
duce a pipeline system for incorporating multi¬ 
sense embeddings into NEP applications, and ex¬ 
amine multiple NEP tasks to see whether and 

^GREEDY models work in a more aggressive way and 
likely make mistakes due to the non-global-optimum nature 
and limited context information 



when multi-sense embeddings can introduce per¬ 
formance boosts. Our results suggest that sim¬ 
ply increasing the dimensionality of baseline 
skip-gram embeddings is sometimes sufficient to 
achieve the same performance wins that come 
from using multi-sense embeddings. That is, the 
most straightforward way to yield better perfor¬ 
mance on these tasks is just to increase embedding 
dimensionality. 

Our results come with some caveats. In partic¬ 
ular, our conclusions are based on the pipelined 
system that we introduce, and other multi-sense 
embedding systems (e.g., a more advanced sense 
learning model or a better sense label model or 
a completely different pipeline system) may find 
stronger effects of multi-sense models. Nonethe¬ 
less we do consistently find improvemenfs for 
mulfi-sense embeddings in some fasks (parf-of- 
speech fagging and semantic relafion identifica¬ 
tion), suggesting fhe benefifs of our mulfi-sense 
models and fhose of ofhers. Perhaps fhe mosf im- 
porfanf implicafion of our resulfs may be fhe ev¬ 
idence fhey provide for fhe importance of going 
beyond simple human-mafching fasks, and fesfing 
embedding models by using fhem as componenfs 
in real NLP applicafions. 

8 Appendix 

In senfimenf classificafion and senfence seman¬ 
tic relafedness fasks, classificafion models require 
embeddings fhaf represenf fhe inpuf af a senfence 
or phrase level. We adopf recurTenf nefworks 
(sfandard ones or LSTMs) and recursive nefworks 
in order fo map a sequence of fokens wifh various 
lengfh fo a vecfor represenfafion. 


order. They compufe fhe represenfafion for each 
parenf node based on ifs immediafe children re¬ 
cursively in a boffom-up fashion unfil reaching fhe 
roof of fhe free. For a given node rj in fhe free 
and ifs leff child r/ieft (wifh represenfafion eieft) and 
righf child T^nght (wifh represenfafion enght), the 
sfandard recursive nefwork calculafes e,,: 

= fanh(FF • + V ■ e^„^„) (6) 


Long Short Term Memory (LSTM) LSTM 
models (Hochreifer and Schmidhuber, 1997) are 
defined as follows: given a sequence of inpufs 
X = {xi,X 2 , ...,Xnx}^ LSTM associafes each 
fimesfep wifh an inpuf, memory and oufpuf gafe, 
respectively denofed as it, ft and o*. We nofa- 
fionally disambiguate e and h, where e* denofe fhe 
vecfor for an individual lexl unif (e.g., word or sen- 
fence) af fime sfep f while ht denofes fhe vecfor 
compufed by fhe LSTM model af lime 1 by com¬ 
bining et and ht-i. a denofes fhe sigmoid func¬ 
tion. W € The vecfor represenfafion ht 

for each fime-sfep t is given by: 


H 


a 


’ ft' 


a 

W- 

ht-i 

Ot 


a 


Ct 

It 


tanh 




Ct = 

ft ■ ct-i + it 

■It 



K = ot 

■ Ct 



(V) 

( 8 ) 

(9) 
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Recurrent Networks A recuiTenf network suc¬ 
cessively takes word wt at step t, combines its vec¬ 
tor representation et with the previously built hid¬ 
den vector ht-i from time t — I, calculates the re¬ 
sulting current embedding ht, and passes it to the 
next step. The embedding ht for the current time t 
is thus: 

ht = tanh(IL • ht-i + V -et) (5) 

where W and V denote compositional matrices. If 
Ns denote the length of the sequence, hjv^ repre¬ 
sents the whole sequence S. 

Recursive Networks Standard recursive models 
work in a similar way by working on neighbor¬ 
ing words by parse tree order rather than sequence 
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